In this section, we cover the uses of the Log, Mean, Sort, Sum, Rank, and Group data transformations on data contained in the U.S. Economic Census in year 2002, 2007, and 2012. We use this transformations to create scatter plots of industrial annual pay both generally and by state, ranked scatter plots of average annual income, and a rank swimlane visual of the states in order by revenue each year the census was taken.
The Logarithm Data Transformation
The mathematical log(x) transformation is easily one of the most popular and widely used data transformations. It is primarily used when dealing with positively (left) skewed data and can be applied to the dependent and/or independent variable(s), depending on the situation. Click on “Make it Linear” to view the pre-transformed data set. Looking at the 2012 US Economic Census, we find data of annual industry payrolls versus revune which is indeed positively skewed to the left. We take the log of the independent and dependent variables in order to illustrate the linear trend that exists between payroll and revenue. Furthermore, looking at the density functions, this is an obvious left skew that is normalized by the log transformation.
The Mean Data Transformation
The mean transformation, mean(x), calculates the average of a vector of values. This is a useful way to compress and summarize many data entries into one meaningful value. In the example below, we used the U.S. Economic Census 2012 data to calculate the mean annual income per person in each state of the U.S. in 2012. This is a number that a viewer of our visualization can relate to.
Below is a plot of the Annual Total Payroll of each industry in each state. There are too many points per state to decipher meaningful information from this visual.
Using the mean function, we’ve been able to make a plot of the mean annual salary of each person in each state. We’ve done this by finding the mean annual payroll of all industries by state and dividing by the population of each state. Below is the result:
The Sort Data Transformation
The sort(x) function is used to rearrange the order of a vector of data entries. It is most often used to put data entries in order from greatest to least or least to greatest. This helps us to better highlight and visualize trends and to identify values that may have more meaning to us, such as which car accelerates the fastest or which brand of ramen has the most sodium.
In our example, we use the sort function to arrange states in order of average annual income per person and we are able to better understand how the values for each state relate to one another. Rather than having an extremely messy plot that is difficult to read, let alone discover trends, the new sorted graphic allows for comprehensive examination in which we notice that Massachusetts and many other NorthEastern states have the highest average salaries while Montana has the lowest.
The Sum Data Transformation
The sum(x) transformation is used to add up the entries of a vector and output their total. This is often a useful step in summarizing data or aiding a further transformation. In many circumstances, sum is used to summarize data with a tool like ddply in the R Progamming Language’s package plyr.
Using data from the U.S. Economic Census in 2002, 2007, and 2012, we calculate the total revenue of each state for each of the three years. The data was originally given as the revenue of each industry in each state; however, this would be unreadable when charted and generally unuseful when attempting to look for trends at the state level.
The Rank Data Transformation
The rank(x) function is used to assign a rank value to each data entry in a vector. This is helpful for sorting and highlighting the relative relationships between data entries.
In the example below, we using the rank transformation to assign a number 1 through 51 to states in the US based on their revenue. We then use this to create a time-series that shows the change in rank of state revenues over a 10 year time span from 2002 to 2012. With this transformation, we have been able to go from a disorganized collection to values to a meaningful visualization of how each state relates to one another based on revenue.
The Group Data Transformation
The group transformation, grpfun(x1,x2,…; f(x)), is when the same function is applied to groups of elements. For example, considering data across 30 buildings, we might want to sum up the total number of data entries per building. We use a group function to do this, grouping entries by building and taking the sum of each group.
In this example, we have applied the rank function, as described above, grouped by each year of data to rank the states in order of revenue for each year the census was taken. This finalizes the state revenue ranking time-series below.
In this section, we use the Lag, Difference, and Sign data transformations in the creation of a time series of Google stock prices from April 2014 to April 2015 (data taken from Yahoo Finance). The time series is color coded to display overall increase and decrease of stock price by month.
The Lag Data Transformation
The lag(x) transformation is used to shift the values associated with a time series down a column by one unit, replacing the first entry with ‘NA’. This is useful for finding the day to day (or month to month, year to year, etc.) difference between between a value as it changes over time increments.
In the Google stock data, we use the lag transformation to offset the average price of the google stock by one month, setting the first month average in the lag column to NA. This allows us to subtract the lag column from the original column of monthly average prices to find the change in average monthly price for each month.
The Difference Data Transformation
The diff(x1, x2) transformation is used to subtract entries of one vector from corresponding entries in another vector of the same length. For example, if I wanted to find the temperature range for a given year in different cities in the US, I would find the difference between the vector containing the maximum temperatures in each city and the vector containing the minimum temperatures in each city.
In the Google Stock Data below, we use the lag transformation to shift a column containing Google’s stock prices on the last date of each month from April 2014 - April 2015 down an index. We then use the difference transformation to subtract the lagged column from the original column.
The Sign Data Transformation
We use the sign transformation to simply determine whether each number in a vector is positive, negative, or 0 by mapping it to 1, -1, or 0 respectively. This is often helpful for identifying periods of time during which a vector is increasing or decreasing.
In the case of the Google Stock Data, we determine the sign of the month to month difference we calculated above in Google stock price data above and chart whether Google Stock is increasing or decreasing month to month. We then use this information to color a time series of Google Stock price green when Google Stock prices is increasing and red when it is decreasing. With a visual like this, we can search for trends in the day to day behavior of a stock during times when it sees an overall increase or decrease.
In this section, we use use the prank(), rank() and cut() data transformations to create a visualization of a distribution of SAT scores. We start with a normal distribution of 10000 scores with mean 1500 and standard deviation 330. We then color this distribution based on percentile rank of scores with the lowest percentile range being the lightest color and the highest percentile range being the darkest color.
The prank() Data Transformation
The prank() data transformation is a short hand for the words “percentile rank”. In a vector of numerical values, prank is used to calculate the proportion of data entries that a specific data entry is greater than or equal to. For example, in the list {1,5,6,7}, the entry ‘6’ has a prank of 75% because it is greater than or equal to 3 out of 4 of the list’s elements.
In the SAT Score Data below, we calculate the percentile ranks of each of the scores in the distribution – that is to say, we calculated the percentage of scores that were less than each individual scores. Since the R programming language does not have a specific prank() function, we calculate prank(scores) by calculating:
(rank of a score/total scores)*100 .
We then assign each value in the distribution its percent rank. Below, we’ve used prank to color our distribution of SAT Scores. It is now much easier to use the large categories of percentile ranks to determine how a score competes with the rest of the SAT scores in the data set. For example, if I received a 1600 on my SAT, I can see that I fall in the 50-75 percentile rank and that at least 25% of people did better than me while at least 50% did worse.
The Arcsine Data Transformation
The arcsine data transformation is a widely used method of normalizing proportion data (data from 0 to 1) for linear regression. It is calculated as arcsin(sqrt(x)), where x is a vector of values between 0 and 1. Though the arcsine transformation is especially popular in the analysis of proportional data, it is a controversial method, and many have argued that the boxcox power transformation is a better alternative to the arcsine transformation.
We demonstrate the arcsine transformation here in a visualization of the proportion of total deaths that occur in each age range (infant to elderly). In the histogram and density distributions below, we can see that the post-transformation data has a more linear and normalized trend than the pre-transformation data. This allows us to more eaisly derive a linear regression for the proportion of deaths by age range and to potentially more consistently visualize the relation from one age range to another. We can evaluate just how effective the arcsine data transformation was by running a robust linear regression on the data. Once the regresssion is fitted, we’ll use the residual and influence transformations to analyze each regression.
The Residual Data Transformation
The residual transformation is used in conjunction with a linear regression on a set of data to calculate how much distance there is between each of the discrete data values and the predicted regression values. While correlation coefficients are widely used to understand a model’s goodness of fit, residuals provide an effective way to visualize it - the larger the residuals, the farther the actual data points are the model, and thus, the less accurate the model.
With the example of proportions of total deaths that occur in each age range, we ran a linear regression on the data before and after the arcsine transformation. We then calculated and plotted the residuals to better visualize how good of a fit our linear regression was before and after the transforamtion. In these charts, we can see that the arcsin transformation did in fact improve the residuals and can be considered a useful tranformation for this data:
In these charts, we can see that the arcsin transformation did in fact improve the residuals and can be considered a useful tranformation for this data:
The Influence Data Transformation
The influence transformation is a measure of how much impact a particular data point has on the correlation between an independent and dependent variable, i.e. the correlation coefficient (r^2 value) of a regression. It is calculated with the following forumla:
influence = (r^2 without data point) - (r^2 with data point)
The influence of a data point can be helpful in identifying which points in a data set are helpful to the trend found in a specific regression and which points are decreasing its goodness of fit.
Here we calculate and visualize the influence of each data point. The size of each data point is proportional to its influence and provdies an easy way to visualize the difference between the non-transformed and arcsin transformed data. It becomes clear that the non-transformed data points have influences in the range of (-.1,.1) while the arcsin tranformation creates the influence range (0,.2) – i.e. each point in the post arcsine transformation regression improves the r^2 coefficient, and thus, improves the fit of the regression line. We notice that this is not the case pretransformation, fortifying the utility of the arcsine transformation.
The ‘Miss’ Data Transformation
The miss() transformation is used to impute missing data. This is to say, where data points are missing from a data set, miss predicts what those values should be and fills them in. One of the most common ways to implement a miss transformation is to run a regression and replace all NA values with the value predicted by the regression. The simplest form of this is to use a linear regression; however, there are other models that can help account for randomness in data by taking errors into account (see Gellman Ch 25). All values that were previously present remain unaltered.
Here we run a miss transformation on the Deaths24 variable (deaths that occur in the age range 90-94 yrs), which contains roughly 15000 missing values, in order to reanalyze the proportions of death in each age range. By linearly regressing Deaths24 on Deaths2-Deaths23, we predict the values and impute into the missing data. The r^2 coefficient for this regression is .99, so we can assume that the predicted values should fit well. Once the values of Deaths24 are imputed, we finally recalculate the sum of total deaths to reconfigure the proportions. As we can see, this alters the shape of the distribution and could cause one to question the validity of ommiting missing data in this case.
Median Transformation
The median is one of the most commonly used properties of a data set. It describes the value that is positioned exactly in the middle of the set. The median is a valuable piece of data because it allows us to see the exact midpoint of a data set at which 50% of the data falls below the median and 50% of the data lies above it. In particular, people value the median because it is not influenced by outliers, as the mean of a data set often is.
In this application of the median, we use a box plot time series to look at the change in the world’s median infant mortality rate over time. This allows us to clearly visualize a decreasing trend in infant mortality rate over time, with key improvements coming at the year 2000. In particular, we notice not only that the median is decreasing over time, but also that the quartiles surrounding the median decrease as well, demonstrating that the median infant mortality rate is not only imrpoving, but that the values surrounding the median are also improving.
Mode Data Transformation
The mode describes the data value that appears the most times in a data set. The mode helpful for identifying which value of a set of values applies to the largest portion of the sample. For example, the winner of an election is the mode of the set of votes, because the winner received the most votes.
With the WHO mortality data, we looked at the death counts by age in each country and identified the age at which most of each country’s deaths occurred. We present the result as a heat map, with the hue representing the most common age-range of death in each country. This allows us to view geographic trends, such as in which countries the most common age of death is high and in which countries the most common age of death is in infants. One of the most striking takeaways from this math is the fact that the mode of South African mortality is 30-34. Upon further research, we’ve found that this is due to a tuberculosis prevalence in the region.
Quotient Data Transformation
The quotient(x1,x2) data transformation divides elements in a vector x1 by its corresponding elements in another vector, x2. The vectors x1 and x2 must be the same length for the quotient function to be valid. Often, quotient is used to find the number of items per category, for example, number of trees per square foot or number of computers per household. The quotient data transformation is also often a step along the way to a visualization, rather than an end goal to be visualized.
In the WHO data for mortality, we use the quotient function to find the infant mortality rate in each country by dividing the number of infant deaths by the total infant population:
infant mortality rate = (number of infant deaths)/(total infant population) We are then able to cut these rates into bins to create a heat map displaying a different color for each range of infant mortality rates.
Cut Data Transformation
The cut data transformation takes sorted data and cuts it into n ranges of values. This is helpful for taking many values that may be difficult to compare and consolidating them into a small number of ranges that can hold more immediate meaning in a visualization. The most common use of a cut function is to make bins in a histogram: each bin has a lower bound and an upper bound and a certain number of data entries populate it.
Here, we use the cut function divide infant mortality rates we calculated above into four ranges. We then use these ragnes to color countries in a heat map based on which of these four ranges of infant mortality rate values their infant mortality rate falls into.